Congress Demographics

Data description and project goals

This R Markdown serves as a reproducible pipeline to do replicate 2 figures from the 538 article on rising congressional ages Congress Today Is Older Than It’s Ever Been Data are helpfully organized into a tidy, longform dataset, where rows correspond to individual congresspersons 2-year period in each unique congress, and columns include information on age, party, congress number, state, etc. The two figures to be reproduced are:

1: The House and Senate are older than ever before

2: Congress is never dominated by generations as old as boomers

538 Readme data information

This directory contains various demographic data about the United States Senate and House of Representatives over time. It’s been used in the following FiveThirtyEight articles:

data_aging_congress.csv contains information about the age of every member of the U.S. Senate and House from the 66th Congress (1919-1921) to the 118th Congress (2023-2025). Data is as of March 29, 2023, and is based on all voting members who served in either the Senate or House in each Congress. The data excludes delegates or resident commissioners from non-states. Any member who served in both chambers in the same Congress was assigned to the chamber in which they cast more votes. We began with the 66th Congress because it was the first Congress in which all senators had been directly elected, rather than elected by state legislatures, following the ratification of the 17th Amendment in 1913.

Header Description Source(s)
congress The number of the Congress that this member’s row refers to. For example, 118 indicates the member served in the 118th Congress (2023-2025). Biographical Directory of the United States Congress; VoteView.com
start_date First day of a Congress. For the 66th Congress to the 73rd Congress, this was March 4. With the ratification of the 20th Amendment, Congress’s start date shifted to Jan. 3 for the 74th Congress to present. U.S. House of Representatives
chamber The chamber a member of Congress sat in: Senate or House. Any member who served in both chambers in the same Congress — e.g., a sitting representative who was later appointed to the Senate — was assigned to the chamber in which they cast more votes. Biographical Directory of the United States Congress; VoteView.com
state_abbrev The two-letter postal abbreviation for the state a member represented. Biographical Directory of the United States Congress; VoteView.com
party_code A code that indicates a member’s party, based on the system used by the Inter-university Consortium for Political and Social Research. The most common values will be 100 for Democrats, 200 for Republicans and 328 for independents. See VoteView.com’s full list for other party codes. If a member switched parties amid a Congress, they are listed with the party they identified with during the majority of their votes. VoteView.com
bioname Full name of member of Congress. Biographical Directory of the United States Congress; VoteView.com
bioguide_id Code used by the Biographical Directory of the United States Congress to uniquely identify each member. Biographical Directory of the United States Congress; VoteView.com
birthday Date of birth for a member. UnitedStates GitHub; Biographical Directory of the United States Congress
cmltv_cong The cumulative number of Congresses a member has or had served in (inclusive of listed congress), regardless of whether the member was in the Senate or House. E.g. 1 indicates it’s a member’s first Congress. Biographical Directory of the United States Congress; VoteView.com
cmltv_chamber The cumulative number of Congresses a member has or had served in a chamber (inclusive of listed congress). E.g. a senator with a 1 indicates it’s the senator’s first Congress in the Senate, regardless of whether they had served in the House before. Biographical Directory of the United States Congress; VoteView.com
age_days Age in days, calculated as start_date minus birthday.
age_years Age in years, calculated by dividing age_days by 365.25.
generation Generation the member belonged to, based on the year of birth. Generations in the data are defined as follows: Gilded (1822-1842), Progressive (1843-1859), Missionary (1860-1882), Lost (1883-1900), Greatest (1901-1927), Silent (1928-1945), baby boomer (1946-1964), Generation X (1965-1980), millennial (1981-1996), Generation Z (1997-2012).

Note: Baby boomers are listed as Boomers, Generation X as Gen X, millennials as Millennial and Generation Z as Gen Z.
Pew Research Center for definitions of Greatest Generation to Generation Z; Strauss and Howe (1991) for definitions for Gilded to Lost generations.

Load in data

  1. Set up directory structure
  2. Read in Congress ages data or download and write to disk if not present
# `here` package will automatically set up the 
project_directory <- here()
ages_filelocation <- file.path(project_directory, "Data/ages.csv")
ages_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-demographics/data_aging_congress.csv"

# Create Data directory if necessary
if(!dir.exists(file.path(project_directory, "Data/"))){
  dir.create(file.path(project_directory, "Data/"))
}

# Fread can handle URLs automatically
if(file.exists(ages_filelocation)){
  print("Reading in ages data from disk")
  ages <- fread(ages_filelocation)
} else{
  print("Ages data not on disk: Loading in ages data from 538 Github:") 
  print(ages_url)
  ages <- fread(ages_url)
  fwrite(ages, ages_filelocation)
}
## [1] "Reading in ages data from disk"

Figure 1: The House and Senate are older than ever before

This plot shows median age by year, with the senate and house as separate lines. The plot is interactive so plotly will be used to imitate this. First, data wrangling will be performed to get cast the longform ages data we have into a year-aggregated dataset.

# Use a dcast/melt chain to get median age for each chamber by year
ages_year <- dcast(ages[, .(year_ = year(start_date), age_years, chamber)],
                  year_ ~ chamber, fun.aggregate = median, value.var = "age_years") %>%
  melt(id.vars = "year_", value.vars = c("House", "Senate"),
       variable.name = "chamber", value.name = "age_years")

# Ordered properly for the legend which lists "Senate" first even though it's later alphabetically
ages_year[, chamber := factor(chamber, levels = c("Senate", "House"))]
p_fig1 <- ggplot(ages_year, aes(x = year_, y = age_years, color = chamber)) +
  geom_step(size = 1.25, aes(x = year_, y = age_years, color = chamber)) + 
  scale_colour_manual(values = c("House" = "darkgreen", "Senate" = "purple")) +
  labs(title = "The House and Senate are older than ever before",
       subtitle = "Median age of the U.S. Senate and U.S. House by Congress, 1919 to 2023",
       caption = "Data is based on all members who served in either the Senate or House in each Congress, which is notated by the year in\nwhich it was seated. Any member who served in both chambers in the same Congress was assigned to the chamber in which\nthey cast more votes.\n\nSOURCES: BIOGRAPHICAL DIRECTORY OF THE U.S. CONGRESS, U.S. HOUSE OF\nREPRESENTATIVES, U.S. SENATE, UNITEDSTATES GITHUB, VOTEVIEW.COM\nFiveThirtyEight",
       color = "",
       x = "",
       y = "") +
  scale_x_continuous(breaks = seq(1920, 2020, by = 10)) +
  scale_y_continuous(breaks = seq(45, 65, by = 5)) +
  geom_hline(yintercept = seq(45,65,by = 5), alpha = 0.3) +
  theme_minimal() +
  theme(legend.title = element_blank(), legend.position = c(0.08,0.98),
        legend.direction = "horizontal", plot.title = element_text(face = "bold"),
        plot.caption = element_text(size = 8, hjust = 0, face = "italic", colour = "gray40"),
        axis.text = element_text(colour = "gray40")) 
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## i Please use `linewidth` instead.

Static version

This ‘static’ plot is more accurate visually to the 538 figure but it is not interactive.

print(p_fig1)

Interactive version

This plot uses plotly to create an interactive figure, but some of the js code will not handle certain aesthetic features (such as horizontal legend layout) so it is not quite accurate to the 538 layout

# For some reason the `tooltip` ordering argument is not working as it lists on the vignette
ggplotly(p_fig1, tooltip = c("color", "x", "y"))
## Warning: plotly.js does not (yet) support horizontal legend items 
## You can track progress here: 
## https://github.com/plotly/plotly.js/issues/53

Figure 2: Congress is never dominated by generations as old as boomers

This figure is an area plot, which again has year on the x-axis. However, this plot differs as it shades the y axis into categories of generation (boomer, millenial, etc.) according the the percentage of that given year’s congress that belongs to each generation.

#Similar dcast/melt chain to above...except with number of each generation
ages_generations <- dcast(ages[, .(year_ = year(start_date), generation)],
                  year_ ~ generation, fun.aggregate = length) %>%
  melt(id.vars = "year_", value.name = "num_seats", variable.name = "generation")

# Get the percentage of seats held be each generation stratified by year
ages_generations[, pct_seats := 100*num_seats/sum(num_seats), by = year_]

# Formatting the generations to be capitalized and in correct order for legend
ages_generations[, generation := factor(toupper(generation), levels = c(
 'GEN Z',
 'MILLENNIAL',
 'GEN X',
 'BOOMERS',
 'SILENT',
 'GREATEST',
 'LOST',
 'MISSIONARY',
 'PROGRESSIVE',
 'GILDED'
))]
# Similar colors to the plot, although generally there are higher RBG (lighter) 
# I think for the figure
cols <- c("red", "lightblue", "purple", "orange", "yellow",
          "lightgreen", "grey15", "magenta", "grey80", "grey90" )

# Geom_Area is the appropriate command to create this 'shaded' plot
p_fig2 <- ggplot(ages_generations) + 
  geom_area(aes(x = year_, y = pct_seats, fill = generation), alpha = 0.8) +
  labs(title = "Baby boomers are the biggest generation in Congress today",
       subtitle = "Share of members in Congress from each generation, 1919 to 2023",
       caption = "Birth years for the Greatest Generation to Generation Z are based on Pew Research Center definitions. For earlier\ngenerations, definitions are based on Strauss and Howe (1991). They are: Gilded (1822-1842), Progressive (1843-1859),\nMissionary (1860-1882), Lost (1883-1900), Greatest (1901-1927), Silent (1928-1945), baby boomer (1946-1964), Generation X\n(1965-1980), millennial (1981-1996), Generation Z (1997-2012).",
       color = "",
       x = "",
       y = "") +
  scale_x_continuous(breaks = seq(1920, 2020, by = 10)) +
  scale_y_continuous(breaks = seq(0, 100, by = 20)) +
  scale_fill_manual(labels = ages_generations[, levels(generation)], values = cols) +
  geom_hline(yintercept = seq(0, 100, by = 20), alpha = 0.3) +
  theme_minimal() +
  theme(legend.title = element_blank(), legend.position = c(0.35,0.98),
        legend.direction = "horizontal", plot.title = element_text(face = "bold"),
        plot.caption = element_text(size = 8, hjust = 0, face = "italic", colour = "gray40"),
        axis.text = element_text(colour = "gray40"), legend.text = element_text(size = 8),
        legend.key.size = unit(0.2, "cm")) 

Static version

This ‘static’ plot is once again more accurate visually to the 538 figure but it is not interactive.

print(p_fig2)

Interactive version

This plot uses plotly to create an interactive figure, but some of the js code will not handle certain aesthetic features (such as horizontal legend layout) so it is not quite accurate to the 538 layout

# For some reason the `tooltip` ordering argument is not working as it lists on the vignette
ggplotly(p_fig2, tooltip = c("color", "x", "y"))
## Warning: plotly.js does not (yet) support horizontal legend items 
## You can track progress here: 
## https://github.com/plotly/plotly.js/issues/53